Problem Statement¶

Context¶

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary¶

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries¶

In [1]:
# Installing the libraries with the specified version.
!pip install numpy==2.0.2 pandas==2.2.2 matplotlib==3.10.0 seaborn==0.13.2 scikit-learn==1.6.1 sklearn-pandas==2.2.0 -q --user
In [2]:
! pip install sqlalchemy-mate==2.0.0.0 uszipcode==1.0.1
Collecting sqlalchemy-mate==2.0.0.0
  Downloading sqlalchemy_mate-2.0.0.0-py3-none-any.whl.metadata (11 kB)
Collecting uszipcode==1.0.1
  Downloading uszipcode-1.0.1-py2.py3-none-any.whl.metadata (8.9 kB)
Requirement already satisfied: sqlalchemy<3.0.0,>=2.0.0 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy-mate==2.0.0.0) (2.0.44)
Requirement already satisfied: prettytable<4.0.0,>=3.0.0 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy-mate==2.0.0.0) (3.17.0)
Requirement already satisfied: attrs in /usr/local/lib/python3.12/dist-packages (from uszipcode==1.0.1) (25.4.0)
Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from uszipcode==1.0.1) (2.32.4)
Collecting pathlib-mate (from uszipcode==1.0.1)
  Downloading pathlib_mate-1.3.2-py3-none-any.whl.metadata (8.4 kB)
Collecting atomicwrites (from uszipcode==1.0.1)
  Downloading atomicwrites-1.4.1.tar.gz (14 kB)
  Preparing metadata (setup.py) ... done
Collecting fuzzywuzzy (from uszipcode==1.0.1)
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl.metadata (4.9 kB)
Collecting haversine>=2.5.0 (from uszipcode==1.0.1)
  Downloading haversine-2.9.0-py2.py3-none-any.whl.metadata (5.8 kB)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.12/dist-packages (from prettytable<4.0.0,>=3.0.0->sqlalchemy-mate==2.0.0.0) (0.2.14)
Requirement already satisfied: greenlet>=1 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy<3.0.0,>=2.0.0->sqlalchemy-mate==2.0.0.0) (3.2.4)
Requirement already satisfied: typing-extensions>=4.6.0 in /usr/local/lib/python3.12/dist-packages (from sqlalchemy<3.0.0,>=2.0.0->sqlalchemy-mate==2.0.0.0) (4.15.0)
Requirement already satisfied: charset_normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->uszipcode==1.0.1) (3.4.4)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->uszipcode==1.0.1) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->uszipcode==1.0.1) (2.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->uszipcode==1.0.1) (2025.11.12)
Downloading sqlalchemy_mate-2.0.0.0-py3-none-any.whl (36 kB)
Downloading uszipcode-1.0.1-py2.py3-none-any.whl (35 kB)
Downloading haversine-2.9.0-py2.py3-none-any.whl (7.7 kB)
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Downloading pathlib_mate-1.3.2-py3-none-any.whl (56 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.0/57.0 kB 2.3 MB/s eta 0:00:00
Building wheels for collected packages: atomicwrites
  Building wheel for atomicwrites (setup.py) ... done
  Created wheel for atomicwrites: filename=atomicwrites-1.4.1-py2.py3-none-any.whl size=6943 sha256=4602e5bb2a9e7d368a01bfc911b205b0bb994624ec325431df6176ed277aaf25
  Stored in directory: /root/.cache/pip/wheels/6b/37/a4/ae30755673c2d1e07228f13b4491fcaef62438f771d5012d07
Successfully built atomicwrites
Installing collected packages: fuzzywuzzy, pathlib-mate, haversine, atomicwrites, sqlalchemy-mate, uszipcode
Successfully installed atomicwrites-1.4.1 fuzzywuzzy-0.18.0 haversine-2.9.0 pathlib-mate-1.3.2 sqlalchemy-mate-2.0.0.0 uszipcode-1.0.1

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [3]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Libraries for scaling numerical features
from sklearn.preprocessing import StandardScaler

# to perform k-means clustering
from sklearn.cluster import KMeans

# to perform silhouette analysis
from sklearn.metrics import silhouette_score

# to perform t-SNE
from sklearn.manifold import TSNE

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
)

from uszipcode import SearchEngine

# to define a common seed value to be used throughout
seed=42

# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
/usr/local/lib/python3.12/dist-packages/fuzzywuzzy/fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')

Loading the dataset¶

In [4]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [5]:
loan_data = pd.read_csv("/content/drive/MyDrive/PGAT-2025/MachineLearning/Project/Loan_Modelling.csv")
In [6]:
data = loan_data.copy()

Data Overview¶

  • Observations
  • Sanity checks

Checking first and last five rows of the dataset¶

In [7]:
# Showing first five rows
data.head()
Out[7]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [8]:
# Showing last five rows
data.tail()
Out[8]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
  • Zip code is not a good category to make assumptions because is granular. So, the feature will be enriched with city and state.

Merging Zip Code enriched info¶

In [9]:
# Transforming zip code into string, because it is a category feature, and transforming into a list
zip_codes = data['ZIPCode'].astype('str').tolist()
In [10]:
# Function to get major city and state from zip code
def get_zip_info(zip_codes):
  # create an empty list
  dataframe_list = []
  # for each zip code, search the zip code info
  for zip_code in zip_codes:
    search = SearchEngine(simple_or_comprehensive=SearchEngine.SimpleOrComprehensiveArgEnum.comprehensive)
    result = search.by_zipcode(zip_code)

    # if result is not None, get major city and state info
    if result:
      city = result.major_city
      state = result.state_abbr

      # create a list with zip code, major city and state
      row = [zip_code, city, state]
      # append row to the empty list
      dataframe_list.append(row)
  # return list of rows
  return dataframe_list
In [11]:
# Calls the function get_zip_info
zipcode_data = get_zip_info(zip_codes)
Download /root/.uszipcode/comprehensive_db.sqlite from https://github.com/MacHu-GWU/uszipcode-project/releases/download/1.0.1.db/comprehensive_db.sqlite ...
  50.00 MB downloaded ...
  100.00 MB downloaded ...
  150.00 MB downloaded ...
  200.00 MB downloaded ...
  250.00 MB downloaded ...
  300.00 MB downloaded ...
  350.00 MB downloaded ...
  400.00 MB downloaded ...
  Complete!
In [12]:
# Transform the list returned into a dataframe and drop duplicated zip codes
zipcode_dataframe = pd.DataFrame(zipcode_data, columns=['ZIPCode', 'MajorCity', 'State']).drop_duplicates()
In [13]:
# Checking the first five rows
zipcode_dataframe.head()
Out[13]:
ZIPCode MajorCity State
0 91107 Pasadena CA
1 90089 Los Angeles CA
2 94720 Berkeley CA
3 94112 San Francisco CA
4 91330 Northridge CA
In [14]:
# Transforms zip code into string and merge the datasets on ZIPCode
data['ZIPCode'] = data['ZIPCode'].astype('str')
enriched_data = pd.merge(data, zipcode_dataframe, how='left', on='ZIPCode')
In [15]:
# Checking first five rows of the new dataset
enriched_data.head()
Out[15]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard MajorCity State
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0 Pasadena CA
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0 Los Angeles CA
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0 Berkeley CA
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0 San Francisco CA
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1 Northridge CA
In [16]:
# Checking last five rows of the new dataset
enriched_data.tail()
Out[16]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard MajorCity State
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0 Irvine CA
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0 La Jolla CA
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0 Ojai CA
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0 Los Angeles CA
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1 Irvine CA

Checking the shape of the dataset¶

In [17]:
# Checking the shape of the dataset
data.shape
Out[17]:
(5000, 14)
In [18]:
# Checking the shape of the dataset
enriched_data.shape
Out[18]:
(5000, 16)

The original dataset has 5000 rows and 14 columns. The enriched one (with city and state) has the same number of rows and 16 columns. The following analysis will be made just with the enriched dataset.

Checking data types in the dataset¶

In [19]:
# Checking data types in the dataset
enriched_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   object 
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
 14  MajorCity           4966 non-null   object 
 15  State               4966 non-null   object 
dtypes: float64(1), int64(12), object(3)
memory usage: 625.1+ KB

There are the following data types in the dataset:

  • Integer: ID, Age, Experience, Income, Family, Education, Mortgage, Personal Loan, Securities Account, CD Account, Online, Credit Card
  • Float: CCAvg
  • String: Zip Code (transformed in string because is a code and to merge the datasets), major city, state

Using the same logic with Zip Code, the column ID will be transformed into string to better handle the analysis, as it is a unique key.

In [20]:
# Transforming ID into string
enriched_data['ID'] = enriched_data['ID'].astype('str')

Treating missing values¶

In [21]:
# Sum of missing values in each column
enriched_data.isnull().sum()
Out[21]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
MajorCity 34
State 34

  • There are 34 missing values in MajorCity and State columns. Since the number is low, the rows with missing values will be deleted below.
In [22]:
# Dropping rows with missing values in NajorCity and State
enriched_data.dropna(subset=['MajorCity', 'State'], inplace=True)
In [23]:
# Checking if the missing values treatment worked
enriched_data.isnull().sum()
Out[23]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
MajorCity 0
State 0

In [24]:
# Checking if the missing values treatment worked
enriched_data.shape
Out[24]:
(4966, 16)

Treating duplicate values¶

In [25]:
# Sum of duplicated values
enriched_data.duplicated().sum()
Out[25]:
np.int64(0)
In [184]:
# CHeck for duplicates on customer id
enriched_data['ID'].duplicated().sum()
Out[184]:
np.int64(0)
  • There is no duplicated values in the dataset

Statistical summary¶

In [26]:
# Statistical summary of the columns
enriched_data.describe().T
Out[26]:
count mean std min 25% 50% 75% max
Age 4966.0 45.353806 11.462759 23.0 35.0 45.0 55.0 67.0
Experience 4966.0 20.119815 11.467874 -3.0 10.0 20.0 30.0 43.0
Income 4966.0 73.827829 46.042272 8.0 39.0 64.0 98.0 224.0
Family 4966.0 2.397100 1.147607 1.0 1.0 2.0 3.0 4.0
CCAvg 4966.0 1.937016 1.743928 0.0 0.7 1.5 2.5 10.0
Education 4966.0 1.880387 0.840197 1.0 1.0 2.0 3.0 3.0
Mortgage 4966.0 56.668747 101.865399 0.0 0.0 0.0 101.0 635.0
Personal_Loan 4966.0 0.096053 0.294694 0.0 0.0 0.0 0.0 1.0
Securities_Account 4966.0 0.104108 0.305431 0.0 0.0 0.0 0.0 1.0
CD_Account 4966.0 0.060814 0.239012 0.0 0.0 0.0 0.0 1.0
Online 4966.0 0.596859 0.490578 0.0 0.0 1.0 1.0 1.0
CreditCard 4966.0 0.293596 0.455455 0.0 0.0 0.0 1.0 1.0
  • The Age column mean id 45 years old.
  • The column Experience has negative values, which is not possible. Those values will be treated below.
  • Income varies from 8 to 224 thousand dollars a year.
  • Average credit card spending varies from 0 to 10 thousand, so not every customer has a credit card with this bank.
  • 50% of the customers do not have a mortgage.
In [27]:
# The negative values in Experience are probably wrong and should be positive numbers, so we will apply the absolute value to the column
enriched_data['Experience'] = enriched_data['Experience'].apply(lambda x: abs(x))

Exploratory Data Analysis.¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?

Univariate analysis¶

Age¶

In [56]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data,x='Age',stat='density')
plt.show()
sns.boxplot(data=enriched_data,x='Age')
plt.show()
No description has been provided for this image
No description has been provided for this image
  • The Age column varies betweeen 23 and 67 years old
  • There are no outliers in the distribution

Experience¶

In [57]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data,x='Experience',stat='density')
plt.show()
sns.boxplot(data=enriched_data,x='Experience')
plt.show()
No description has been provided for this image
No description has been provided for this image
In [58]:
print(enriched_data['Experience'].min())
print(enriched_data['Experience'].max())
0
43
  • The experience varies between 0 and 43 years
  • There are no outliers in the distribution

Income¶

In [59]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data,x='Income',stat='density')
plt.show()
sns.boxplot(data=enriched_data,x='Income')
plt.show()
No description has been provided for this image
No description has been provided for this image
In [60]:
print(enriched_data['Income'].min())
print(enriched_data['Income'].max())
8
224
  • The distribution is skewed to the right
  • There are outliers in the distribution, which is intuitive for monetary values.
  • The column distribution varies between 8 thousand and 224 thousand a year.

Family¶

In [61]:
# Plots a countplot of the feature Family
sns.countplot(data=enriched_data, x='Family')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
  • The number of customers across number of family members is well balanced, but still most of the customers have just one family member.

CCAvg¶

In [62]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data,x='CCAvg',stat='density')
plt.show()
sns.boxplot(data=enriched_data,x='CCAvg')
plt.show()
No description has been provided for this image
No description has been provided for this image
  • The distribution is skewed to the right
  • There are many outliers in the right portion of the boxplot, which is normal for monetary values
  • The average spending varies between 0 and 10 thousands.

Mortgage¶

In [63]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data,x='Mortgage',stat='density')
plt.show()
sns.boxplot(data=enriched_data,x='Mortgage')
plt.show()
No description has been provided for this image
No description has been provided for this image
  • Many customers do not have mortgage, so the majority of the rows are concentrated in zero and the number of outliers is inflated.
In [64]:
# Plots the column distribution with a boxplot and a histogram
sns.histplot(data=enriched_data[enriched_data['Mortgage'] > 0],x='Mortgage',stat='density')
plt.show()
sns.boxplot(data=enriched_data[enriched_data['Mortgage'] > 0],x='Mortgage')
plt.show()
No description has been provided for this image
No description has been provided for this image
In [65]:
print(enriched_data['Mortgage'].min())
print(enriched_data['Mortgage'].max())
0
635
  • Many customers do not have mortgage, so we also plotted the distribution of only those who have, to better analyze the distribution

Distribution of customers who have mortage:

  • The distribution is skewed to the right
  • There are many outliers to the left, which makes sense in monetary values
  • The distribution varies between 0 and 635 thousands.

Education¶

In [66]:
# Plots a countplot of the feature Education
sns.countplot(data=enriched_data, x='Education')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
  • The number of customers across educational levels is well balanced, but still most of the customers are on Undergrad level.

Personal loan¶

In [67]:
# Plots a countplot of the feature personal loan
sns.countplot(data=enriched_data, x='Personal_Loan')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [68]:
enriched_data['Personal_Loan'].value_counts(normalize=True)
Out[68]:
proportion
Personal_Loan
0 0.903947
1 0.096053

  • Most of the customers do not have a personal loan
  • 9.6% of the customers have personal loans

Securities account¶

In [69]:
# Plots a countplot of the feature securities account
sns.countplot(data=enriched_data, x='Securities_Account')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [70]:
enriched_data['Securities_Account'].value_counts(normalize=True)
Out[70]:
proportion
Securities_Account
0 0.895892
1 0.104108

  • Most of the customers do not have securities account with the bank (89%).

CD Account¶

In [71]:
# Plots a countplot of the feature CD account
sns.countplot(data=enriched_data, x='CD_Account')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [72]:
enriched_data['CD_Account'].value_counts(normalize=True)
Out[72]:
proportion
CD_Account
0 0.939186
1 0.060814

  • Most customers (93.9%) do not have a certificate deposit with the bank

Online¶

In [73]:
# Plots a countplot of the feature online
sns.countplot(data=enriched_data, x='Online')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [74]:
enriched_data['Online'].value_counts(normalize=True)
Out[74]:
proportion
Online
1 0.596859
0 0.403141

  • The majority of customers (60%) use internet banking facilities, but the proportion of those who doesn't is still relevant (40%).

Credit Card¶

In [75]:
# Plots a countplot of the feature credit card
sns.countplot(data=enriched_data, x='CreditCard')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [76]:
# check the proportion of customers for every value on the feature
enriched_data['CreditCard'].value_counts()
Out[76]:
count
CreditCard
0 3508
1 1458

In [77]:
# check the proportion of customers for every value on the feature
enriched_data['CreditCard'].value_counts(normalize=True)
Out[77]:
proportion
CreditCard
0 0.706404
1 0.293596

  • Approximatedly 30% of the customers (1458) have credit cards issued in other banks

Major City¶

In [78]:
# Creates a dataset with the top ten cities in terms of orders
plt.figure(figsize=(12, 8))

ax = sns.barplot(data=enriched_data['MajorCity'].value_counts().reset_index().nlargest(10, 'count'), x='MajorCity', y='count')

# Plots a bar plot of those cities
for container in ax.containers:
    ax.bar_label(container)
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [79]:
# checks the concentration of customers on top ten cities

df_plot = enriched_data.sort_values('MajorCity').groupby(['MajorCity']).agg({'State':'count'}).sort_values(by='State',ascending=False)
df_plot.rename(columns={'State': 'Customers'}, inplace=True)
df_plot['Cumulative Sum'] = df_plot.cumsum()
total_orders = df_plot['Customers'].sum()
df_plot['Cumulative Percentage'] = (df_plot['Cumulative Sum'] / total_orders) * 100
df_plot = df_plot.sort_values(by='Customers',ascending=False).head(10)

fig, ax1 = plt.subplots(figsize=(12, 8))

sns.barplot(x=df_plot.index, y=df_plot['Customers'], ax=ax1)
ax1.tick_params(rotation=90)
ax2 = ax1.twinx()
ax2.plot(ax1.get_xticks(), df_plot['Cumulative Percentage'], color='red')
plt.show()
No description has been provided for this image
  • The first graph above shows the top 10 cities regarding number of customers
  • Los Angeles, San Diego and San Francisco are the top 3 cities regarding the number of customers.
  • The second graph show that the top 10 major cities concentrate approximatedly 35% of the customers.

State¶

In [80]:
# Creates a barplot of the feature state and value counts
plt.figure(figsize=(20, 10))

ax = sns.barplot(data=enriched_data['State'].value_counts().reset_index(), x='State', y='count')
for container in ax.containers:
    ax.bar_label(container)
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
  • All the customers live in California

Bivariate and Multivariate Analysis¶

Linear correlation¶

In [81]:
# plots a heatmap of the numeric columns
plt.figure(figsize=(10,5))
sns.heatmap(enriched_data.select_dtypes(include=np.number).corr(),annot=True,cmap='Spectral',vmin=-1,vmax=1)
plt.show()
No description has been provided for this image

The following features have strong correlation with personal loans:

  • Certificate Deposit Account (positive strong correlation)
  • Average Spending on Credit Cards (positive strong correlation)
  • Income (positive strong correlation)

Personal loan has no correlation with age and experience. Personal loan has a low (but still significative) positive correlation with Education, which means customers with higher educational levels can tend to get more personal loans.

Some features have strong linear correlation between each other:

  • Pesonal loan (50%) and average spending on credit card (65%) have high positive correlation with income and with each other (37%)
  • Age and work experience have extremely high correlation with each other (99%). Intuitively makes sense, because the older the customer, higher tends to be their work experience.
  • Personal loan, having a security account and having credit card issued by other banks have high positive correlation with having a certificate deposit in the bank (32%, 32% and 28%, respectively)
In [82]:
# plots a pairplot of the numeric columns
sns.pairplot(data=enriched_data.select_dtypes(include=np.number), diag_kind="kde")
plt.show()
No description has been provided for this image

The pairplot supports the fidings above, showing the almost perfect correlation between age and work experience and high correlation positive between mortgage, average credit card spending and income.

Customer behavior across personal loan¶

General adherence¶
In [83]:
# count the customers of each value for personal loan feature
enriched_data['Personal_Loan'].value_counts()
Out[83]:
count
Personal_Loan
0 4489
1 477

In [84]:
# count the customers of each value for personal loan feature
enriched_data['Personal_Loan'].value_counts(normalize=True)
Out[84]:
proportion
Personal_Loan
0 0.903947
1 0.096053

  • 477 customer (9.6%) have personal loans, which means there is a great opportunity for expanding the market.
Personal loan behavior in relation to demographic features¶
Age and Experience¶
In [85]:
# creates a scatterplot of age, experience and colored by personal loan
sns.scatterplot(x="Age", y="Experience", hue="Personal_Loan", data=enriched_data)
plt.show()
No description has been provided for this image
  • There is almost a perfect correlation between age and experience, but it does not seem to have a pattern with personal loan
Education¶
In [86]:
# creates a countplot of the feature colored by personal loan
sns.countplot(data=enriched_data, x='Education', hue='Personal_Loan')
plt.xticks(rotation=90)
plt.show()
No description has been provided for this image
In [87]:
enriched_data[enriched_data['Personal_Loan'] == 1]['Education'].value_counts(normalize=True)
Out[87]:
proportion
Education
3 0.427673
2 0.377358
1 0.194969

There seems to be a pattern between having a personal loan and Education level:

  • 20% of the customers with personal loan have undergrad level
  • 37% of the customers have postgraduate level
  • 42% of the customers have advanced/professional As the level of Education increases, there is a tendency to be more interested in Personal Loans.
Family¶
In [88]:
# creates a boxplot of personal loan and family
sns.boxplot(data=enriched_data,x='Personal_Loan',y='Family')
plt.show()
No description has been provided for this image

There seems to be a pattern between having a personal loan and number of members in the family:

  • 75% of customers with personal loans have at least two members in the family
  • 50% of customers with personal loans have at least three members in the family
Major Cities¶
In [89]:
# created a plot of the proportion of customers with and without personal loan for the top 10 major cities

top_major_cities = enriched_data.groupby(['MajorCity', 'Personal_Loan']).size().nlargest(10).reset_index(name='count')['MajorCity'].unique()

enriched_data_top_major_cities = enriched_data[enriched_data['MajorCity'].isin(top_major_cities)]

count_data = enriched_data_top_major_cities.groupby(['MajorCity', 'Personal_Loan']).size().reset_index(name='count')

pivot_data = count_data.pivot(index='MajorCity', columns='Personal_Loan', values='count')

percent_data = pivot_data.div(pivot_data.sum(axis=1), axis=0) * 100

ax = percent_data.plot(kind='bar', stacked=True, figsize=(15, 8), rot=0)
plt.show()
No description has been provided for this image

The major cities with highest percentage of customers with personal loans are: Berkeley, La Jolla, Palo Alto and Stanford

Income¶
In [90]:
# creates a boxplot of personal loan and income
sns.boxplot(data=enriched_data,x='Personal_Loan',y='Income')
plt.show()
No description has been provided for this image
  • Customers with personal loan tend to have higher income
  • 75% of customers with personal loans have annual income grater than 100 thousand dollars. On the other hand, 75% of customers without personal loans have income lower than 100 thousand dollars.
  • However, there are high income outliers in customers who doesn't have personal loans. Thus, there might be a potential market for growing customers with high income and still no personal loans.
Customer behavior and personal loan behavior¶
Credit Card Behavior¶
In [91]:
# creates a boxplot of personal loan and average credit card spending
sns.boxplot(data=enriched_data,x='Personal_Loan',y='CCAvg')
plt.show()
No description has been provided for this image
  • Customers with personal loans tend to have higher average credit card spending
  • 75% of customers with personal loan have an average spending greater than 2 thousand dollars a month. On the other hand, 75% of customers without personal loan have an average spending lower than 3 thousand dollars a month.
  • Such as with income, there are outliers of customers with high average spending and no personal loans, which show a potential for expanding personal loans.
In [92]:
enriched_data[enriched_data['Personal_Loan'] == 0]['CreditCard'].value_counts(normalize=True)
Out[92]:
proportion
CreditCard
0 0.706839
1 0.293161

In [93]:
enriched_data[enriched_data['Personal_Loan'] == 1]['CreditCard'].value_counts(normalize=True)
Out[93]:
proportion
CreditCard
0 0.702306
1 0.297694

  • The proportion of customers with and without personal loans who have credit cards in other banks are the same, so the credit card feature does not seem to have any relation with acquiring personal loans.
Mortgage¶
In [94]:
# creates a boxplot of personal loan and mortgage
sns.boxplot(data=enriched_data,x='Personal_Loan',y='Mortgage')
plt.show()
No description has been provided for this image
  • Since most customers don't have any mortgage, the first boxplot supports the low positive correlation between personal loan and mortgage in the heatmap
In [95]:
sns.boxplot(data=enriched_data[enriched_data['Mortgage'] > 0],x='Personal_Loan',y='Mortgage')
plt.show()
No description has been provided for this image
  • However, when we compare customers the distribution of customers with mortgage with and without personal loans, we can see there is a tendency of customers with higher mortgage also having personal loans.
  • 50% of customers with personal loans and mortgage have greater than 300 thousand dollar mortgages. On the other hand, 50% of customers without personal loans and but with mortgage have lower than 200 thousand dollar mortgages.
Securities account¶
In [96]:
# check the proportion of customers with and without peronal loan who doesn't have securities account
enriched_data[enriched_data['Securities_Account'] == 0]['Personal_Loan'].value_counts(normalize=True)
Out[96]:
proportion
Personal_Loan
0 0.906271
1 0.093729

In [97]:
# check the proportion of customers with and without peronal loan who have securities account
enriched_data[enriched_data['Securities_Account'] == 1]['Personal_Loan'].value_counts(normalize=True)
Out[97]:
proportion
Personal_Loan
0 0.883946
1 0.116054

  • 11% of customers with securities account also have personal loans
  • 9% of customers without securities account have personal loans
  • The proportion of customers with and without securities account who have personal loans don't vary much
CD Account¶
In [98]:
# check the proportion of customers with and without peronal loan who doesn't have cd account
enriched_data[enriched_data['CD_Account'] == 0]['Personal_Loan'].value_counts(normalize=True)
Out[98]:
proportion
Personal_Loan
0 0.927744
1 0.072256

In [99]:
# check the proportion of customers with and without peronal loan who have cd account
enriched_data[enriched_data['CD_Account'] == 1]['Personal_Loan'].value_counts(normalize=True)
Out[99]:
proportion
Personal_Loan
0 0.536424
1 0.463576

  • Customers with certificate deposits have more customers with personal loans (46%) than customers without certificate deposits (7%)
  • Customers with certificate deposits show potential for making personal loans
Internet Banking¶
In [100]:
# check the proportion of customers with and without peronal loan who use internet banking
enriched_data[enriched_data['Online'] == 0]['Personal_Loan'].value_counts(normalize=True)
Out[100]:
proportion
Personal_Loan
0 0.907093
1 0.092907

In [101]:
# check the proportion of customers with and without peronal loan who doesn't use internet banking
enriched_data[enriched_data['Online'] == 1]['Personal_Loan'].value_counts(normalize=True)
Out[101]:
proportion
Personal_Loan
0 0.901822
1 0.098178

  • Customers who use internet banking and who doesn't have similar proportions of having personal loans
Conclusion¶
  • Customers with more advanced Education level tend to have more chance to purchase a personal loan.
  • 75% of customers with personal loans have annual income grater than 100 thousand dollars. On the other hand, 75% of customers without personal loans have income lower than 100 thousand dollars. However, there are high income outliers in customers who doesn't have personal loans. Thus, there might be a potential market for growing customers with high income and still no personal loans.
  • 75% of customers with personal loan have an average spending greater than 2 thousand dollars a month. On the other hand, 75% of customers without personal loan have an average spending lower than 3 thousand dollars a month. Such as with income, there are outliers of customers with high average spending and no personal loans, which show a potential for expanding personal loans.
  • Between customers who have mortgage, the higher the value of the mortgage, the higher the tendency of having a personal loan
  • Customers with certificate deposit accounts have a higher chance of purchasing personal loans

Customer behavior across other bank products¶

Credit card¶
In [102]:
# creates a feature has credit card based on the average CC spending
enriched_data['has_credit_card'] = np.where(enriched_data['CCAvg'] > 0, 1, 0)
In [103]:
# counts the number of customers who have and don't have credit cards
enriched_data['has_credit_card'].value_counts()
Out[103]:
count
has_credit_card
1 4860
0 106

In [104]:
# counts the proportion of customers who have and don't have credit cards
enriched_data['has_credit_card'].value_counts(normalize=True)
Out[104]:
proportion
has_credit_card
1 0.978655
0 0.021345

In [105]:
# counts the number of customers for each value of CreditCard
enriched_data['CreditCard'].value_counts()
Out[105]:
count
CreditCard
0 3508
1 1458

  • 4860 customers (97% of total customers) have credit card in our bank and 1458 customers have credit cards issued in other bank
In [106]:
# plots graphs for different features to analyze credit card spending
top_major_cities = enriched_data.groupby(['MajorCity']).size().nlargest(10).reset_index(name='count')['MajorCity'].unique()

enriched_data_top_major_cities = enriched_data[enriched_data['MajorCity'].isin(top_major_cities)]

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 15))
sns.scatterplot(data=enriched_data,x='Age',y='CCAvg', ax=axes[0, 0])
sns.scatterplot(data=enriched_data,x='Experience',y='CCAvg', ax=axes[0, 1])
sns.scatterplot(data=enriched_data,x='Income',y='CCAvg', ax=axes[1, 0])
sns.boxplot(data=enriched_data_top_major_cities,x='MajorCity',y='CCAvg', ax=axes[1, 1])
sns.boxplot(data=enriched_data,x='Family',y='CCAvg', ax=axes[2, 0])
sns.boxplot(data=enriched_data,x='Education',y='CCAvg', ax=axes[2, 1])
plt.tight_layout()
plt.show()
No description has been provided for this image

When we observe average spending on credit card cross demographic characteristics, we can conclude:

  • There is no relationship between average spending and age or experience.
  • There is a positive linear relationship between income and average spending
  • There is little change in distributions of average spending between top 10 major cities (chosen by the highest number of customers)
  • Customers with two memnbers in the family tend to have slightely higher average spending
  • Customers with undegrad level of education tend to have slightely higher average spending
Certificate deposits¶
In [107]:
enriched_data['CD_Account'].value_counts()
Out[107]:
count
CD_Account
0 4664
1 302

In [108]:
enriched_data['CD_Account'].value_counts(normalize=True)
Out[108]:
proportion
CD_Account
0 0.939186
1 0.060814

  • Only 302 (6%) customers have certificate deposits on the bank
In [109]:
# plots graphs for different features to analyze CD Account
top_major_cities = enriched_data.groupby(['MajorCity']).size().nlargest(10).reset_index(name='count')['MajorCity'].unique()

enriched_data_top_major_cities = enriched_data[enriched_data['MajorCity'].isin(top_major_cities)]

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 15))
sns.boxplot(data=enriched_data,x='CD_Account',y='Age', ax=axes[0, 0])
sns.boxplot(data=enriched_data,x='CD_Account',y='Experience', ax=axes[0, 1])
sns.boxplot(data=enriched_data,x='CD_Account',y='Income', ax=axes[1, 0])
sns.countplot(data=enriched_data_top_major_cities,x='MajorCity',hue='CD_Account', ax=axes[1, 1])
sns.countplot(data=enriched_data,x='Family',hue='CD_Account', ax=axes[2, 0])
sns.countplot(data=enriched_data,x='Education',hue='CD_Account', ax=axes[2, 1])
plt.tight_layout()
plt.show()
No description has been provided for this image
  • There seems to be no relationship between having a certificate deposit and age or experience
  • Customers with a certificate deposit tend to have higher income. However, there are outliers of high income without certificate deposits.
  • According to intuition, cities with more customers tend to have more customers with cwertificate deposit
  • There seems to be no relationship between having a certificate deposit and education or family
Mortgage¶
In [110]:
enriched_data['has_mortgage'] = np.where(enriched_data['Mortgage'] > 0, 1, 0)
In [111]:
enriched_data['has_mortgage'].value_counts()
Out[111]:
count
has_mortgage
0 3435
1 1531

In [112]:
enriched_data['has_mortgage'].value_counts(normalize=True)
Out[112]:
proportion
has_mortgage
0 0.691704
1 0.308296

  • 1531 customers (30.8%) have mortgage
In [113]:
# plots graphs for different features to analyze mortgage

top_major_cities = enriched_data.groupby(['MajorCity']).size().nlargest(10).reset_index(name='count')['MajorCity'].unique()

enriched_data_top_major_cities = enriched_data[enriched_data['MajorCity'].isin(top_major_cities)]

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 15))
sns.scatterplot(data=enriched_data,x='Age',y='Mortgage', ax=axes[0, 0])
sns.scatterplot(data=enriched_data,x='Experience',y='Mortgage', ax=axes[0, 1])
sns.scatterplot(data=enriched_data,x='Income',y='Mortgage', ax=axes[1, 0])
sns.boxplot(data=enriched_data_top_major_cities,x='MajorCity',y='Mortgage', ax=axes[1, 1])
sns.boxplot(data=enriched_data,x='Family',y='Mortgage', ax=axes[2, 0])
sns.boxplot(data=enriched_data,x='Education',y='Mortgage', ax=axes[2, 1])
plt.tight_layout()
plt.show()
No description has been provided for this image
  • There seems to be no relationship between mortgage value and age or experience
  • There seems to be a positive linear correlation between income and mortgage value
  • There is outliers if high mortgage valuie for every top major city
  • There seems to be no relationship between mortgage value and family or education. Also, there is outliers of high income for every boxplot in this section.
Securities account¶
In [114]:
enriched_data['Securities_Account'].value_counts()
Out[114]:
count
Securities_Account
0 4449
1 517

In [115]:
enriched_data['Securities_Account'].value_counts(normalize=True)
Out[115]:
proportion
Securities_Account
0 0.895892
1 0.104108

  • 517 customers (10.4%) have securities account
In [116]:
# plots graphs for different features to analyze securities account

top_major_cities = enriched_data.groupby(['MajorCity']).size().nlargest(10).reset_index(name='count')['MajorCity'].unique()

enriched_data_top_major_cities = enriched_data[enriched_data['MajorCity'].isin(top_major_cities)]

fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(20, 15))
sns.boxplot(data=enriched_data,x='Securities_Account',y='Age', ax=axes[0, 0])
sns.boxplot(data=enriched_data,x='Securities_Account',y='Experience', ax=axes[0, 1])
sns.boxplot(data=enriched_data,x='Securities_Account',y='Income', ax=axes[1, 0])
sns.countplot(data=enriched_data_top_major_cities,x='MajorCity',hue='Securities_Account', ax=axes[1, 1])
sns.countplot(data=enriched_data,x='Family',hue='Securities_Account', ax=axes[2, 0])
sns.countplot(data=enriched_data,x='Education',hue='Securities_Account', ax=axes[2, 1])
plt.tight_layout()
plt.show()
No description has been provided for this image
  • There seems to be no relationship between having a securities account and age, experience or income
  • Supporting intuition, cities with more customers tend to have more customers with securities account
  • Customers wioth 3 members on the family have slightely lower probability of having securities account
  • Customers with only undergrad level of education have slightely higher possibility of having securities account
Conclusion¶
  • The bank product with most adherence is credit card, since 97% of the customers have a credit card in our bank
  • The other products have 10% or lower adherence
  • Age and experience don't have a relationship with any product in the bank
  • Income is a relevant feature for most products
  • Some products seem to have a low relationship with family of education

Data Preprocessing¶

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Missing value treatment¶

In [ ]:
# checking missing values
enriched_data.isnull().sum()
Out[ ]:
0
ID 0
Age 0
Experience 0
Income 0
ZIPCode 0
Family 0
CCAvg 0
Education 0
Mortgage 0
Personal_Loan 0
Securities_Account 0
CD_Account 0
Online 0
CreditCard 0
MajorCity 0
State 0

  • There are no more missing values in the dataset

Feature engineering¶

In [123]:
# copies the dataset to mantain the enriched original one
data_preparation = enriched_data.copy()
In [124]:
# creates features
data_preparation['has_credit_card'] = np.where(data_preparation['CCAvg'] > 0, 1, 0)
data_preparation['has_mortgage'] = np.where(data_preparation['Mortgage'] > 0, 1, 0)
data_preparation['number_of_products'] = data_preparation['Securities_Account'] + data_preparation['CD_Account'] + data_preparation['has_credit_card'] + data_preparation['has_mortgage']
In [125]:
# drop unnecessary features
data_preparation.drop(["Experience", "ID", "ZIPCode", "State"], axis=1, inplace=True)
In [128]:
# create new feature

major_cities = enriched_data['MajorCity'].value_counts(ascending=False).head(10)

top_major_cities = major_cities.index.tolist()

data_preparation['is_in_top_major_city'] = np.where(data_preparation['MajorCity'].isin(top_major_cities), 1, 0)

data_preparation.drop(["MajorCity"], axis=1, inplace=True)
In [129]:
data_preparation.head()
Out[129]:
Age Income Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard has_credit_card has_mortgage number_of_products is_in_top_major_city
0 25 49 4 1.6 1 0 0 1 0 0 0 1 0 2 0
1 45 34 3 1.5 1 0 0 1 0 0 0 1 0 2 1
2 39 11 1 1.0 1 0 0 0 0 0 0 1 0 1 1
3 35 100 1 2.7 2 0 0 0 0 0 0 1 0 1 1
4 35 45 4 1.0 2 0 0 0 0 0 1 1 0 1 0
  • Two new features has been created:
  1. Number of banking products the customer already purchased (excluding target variable)
  2. Is in top major cities: boolean feature regarding if customer is from one of the ten top major cities in the database. This transformation was made to avoid creating more than 200 dummy features, since the number of different cities in the database is high. Since 35% of the customers are concentrated in the top 10 major cities, it makes sense to use this logic as a parameter.
  • The column ID has been dropped because it is a unique key
  • The column State has been dropped because all customers are from California
  • The column ZipCode has been dropped because it was replaced by MajorCity
  • The column Experience has been dropped because it has 0.99 correlation with Age, avoiding multicollinearity

Outlier detection and treatment¶

In [138]:
# plot feature distribution for numeric columns to analyze outliers

numeric_columns = data_preparation.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data_preparation[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image
  • There are outliers in the data, but the values will not be treated, because they are proper values

Preparing data for modeling¶

In [139]:
# separates dataset in indepent variables (x) and dependent variable (y)
X = data_preparation.drop(["Personal_Loan"], axis=1)
y = data_preparation["Personal_Loan"]
In [140]:
# divide datasets in training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1
)
In [141]:
# verify the shape of the sets
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
Shape of Training set :  (3476, 14)
Shape of test set :  (1490, 14)
In [142]:
# checks the proportion of classes
y_train.value_counts(normalize=True)
Out[142]:
proportion
Personal_Loan
0 0.905351
1 0.094649

In [143]:
# checks the proportion of classes
y_test.value_counts(normalize=True)
Out[143]:
proportion
Personal_Loan
0 0.900671
1 0.099329

  • The data has been separated in X (independent variables) and y (dependent variable)
  • Then, X and y has been separated in training set and test set
  • The proportion of classes in both y_train and y_test is similar

Model Building¶

Model Evaluation Criterion¶

The model can make wrong predictions:

  • False positive (FP): determines a customer will purchase a personal loan, when they won't
  • False negative (FN): determines a customer won't purchase a personal loan, when they will

The consequence of each type of wrong prediction is:

  • FP: targetting a customer that won't purchase the loan is not cost effective
  • FN: not targetting a customer that would purchase the personal loan minimizes the chance of miximizing the success ratio of the campaign.

Since the bank is a rapidly growing business, the goal is to increase the success ratio of the campaign and marketing cost was not mentioned as an issue, growth will be prioritized over cost efficiency. Thus, the recall must be maximized, to guarantee the minimization of false negative, maximizing the chance of including customers who probably will purchase the personal loan.

In [144]:
def model_performance_classification_sklearn(model, test, target):

    # predicting using the test variables
    pred = model.predict(test)

    # calculating performance metris
    acc = accuracy_score(target, pred)
    recall = recall_score(target, pred)
    precision = precision_score(target, pred)
    f1 = f1_score(target, pred)

    # creating a dataframe of metrics
    df_performance = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_performance
In [145]:
def confusion_matrix_sklearn(model, test, target):
    # creating prediction over test set
    y_pred = model.predict(test)

    # creating confusion matrix
    cm = confusion_matrix(target, y_pred)

    # formatting labels
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    # formatting plot
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building¶

Decision Tree (default)¶

In [146]:
# fits the model
model_decision_tree_default = DecisionTreeClassifier(random_state=1)
model_decision_tree_default.fit(X_train, y_train)
Out[146]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
In [147]:
# plots confusion matrix on training set
confusion_matrix_sklearn(model_decision_tree_default, X_train, y_train)
No description has been provided for this image
In [148]:
# checks the performance on training set
decision_tree_default_perf_train = model_performance_classification_sklearn(
    model_decision_tree_default, X_train, y_train
)
decision_tree_default_perf_train
Out[148]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [149]:
# plots confusion matrix on test set
confusion_matrix_sklearn(model_decision_tree_default, X_test, y_test)
No description has been provided for this image
In [150]:
# checks the performance on test set
decision_tree_default_perf_test = model_performance_classification_sklearn(
    model_decision_tree_default, X_test, y_test
)
decision_tree_default_perf_test
Out[150]:
Accuracy Recall Precision F1
0 0.978523 0.871622 0.908451 0.889655
  • The model has performed perfectly in the training set and performed worst in the test set (recall of 87%). Thus, the default model is overfitting.
  • Measures will be taken to improve the model performance

Model Performance Improvement¶

Decision tree (Class weight: balanced)¶

In [151]:
# fits the model
decision_tree_balanced = DecisionTreeClassifier(random_state=1, class_weight="balanced")
decision_tree_balanced.fit(X_train, y_train)
Out[151]:
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', random_state=1)
In [152]:
# plots confusion matrix on training set
confusion_matrix_sklearn(decision_tree_balanced, X_train, y_train)
No description has been provided for this image
In [153]:
# checks the performance on training set
decision_tree_balanced_perf_train = model_performance_classification_sklearn(
    decision_tree_balanced, X_train, y_train
)
decision_tree_balanced_perf_train
Out[153]:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
In [154]:
# plots confusion matrix
confusion_matrix_sklearn(decision_tree_balanced, X_test, y_test)
No description has been provided for this image
In [155]:
# checks the performance on test set
decision_tree_balanced_perf_test = model_performance_classification_sklearn(
    decision_tree_balanced, X_test, y_test
)
decision_tree_balanced_perf_test
Out[155]:
Accuracy Recall Precision F1
0 0.979866 0.885135 0.909722 0.89726
  • The first attempt of improving the model is using balanced class weights on the target. In both training and test sets, the class = 0 is approximatedly 90%, so the dataset if highly imbalanced. The class_weight=balance will adjust the weights of each class.
  • Again, the model has performed perfectly in the training set and perofrmed worst in the test set (recall of 88%). Thus, this approach is also overfitting.

Decision tree (pre-pruning)¶

In [156]:
# define the parameters of the tree
max_depth_values = np.arange(2, 7, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]

best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0

# iterate over hyperparameters
for max_depth in max_depth_values:
    for max_leaf_nodes in max_leaf_nodes_values:
        for min_samples_split in min_samples_split_values:

            # model
            estimator = DecisionTreeClassifier(
                max_depth=max_depth,
                max_leaf_nodes=max_leaf_nodes,
                min_samples_split=min_samples_split,
                class_weight='balanced',
                random_state=42
            )

            # fitting the model
            estimator.fit(X_train, y_train)

            # predictions
            y_train_pred = estimator.predict(X_train)
            y_test_pred = estimator.predict(X_test)

            # recall metrics
            train_recall_score = recall_score(y_train, y_train_pred)
            test_recall_score = recall_score(y_test, y_test_pred)

            # difference between training and testing scores
            score_diff = abs(train_recall_score - test_recall_score)

            # Uupdate best score and estimator if needed
            if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
                best_score_diff = score_diff
                best_test_score = test_recall_score
                best_estimator = estimator

# print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found:
Max depth: 2
Max leaf nodes: 50
Min samples split: 10
Best test recall score: 1.0
In [157]:
best_estimator_model = best_estimator

# fitting the best model to the training data
best_estimator_model.fit(X_train, y_train)
Out[157]:
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
                       max_leaf_nodes=50, min_samples_split=10,
                       random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(2),
                       max_leaf_nodes=50, min_samples_split=10,
                       random_state=42)
In [158]:
# plots confusion matrix on training set
confusion_matrix_sklearn(best_estimator_model, X_train, y_train)
No description has been provided for this image
In [159]:
# checks the performance on training set
decision_tree_pre_perf_train = model_performance_classification_sklearn(
    best_estimator_model, X_train, y_train
)
decision_tree_pre_perf_train
Out[159]:
Accuracy Recall Precision F1
0 0.795455 1.0 0.316346 0.480643
In [160]:
# plots the confusion matrix on test set
confusion_matrix_sklearn(best_estimator_model, X_test, y_test)
No description has been provided for this image
In [161]:
# checks the performance on test set
decision_tree_pre_perf_test = model_performance_classification_sklearn(
    best_estimator_model, X_test, y_test
)
decision_tree_pre_perf_test
Out[161]:
Accuracy Recall Precision F1
0 0.787919 1.0 0.318966 0.48366
  • This time, the model does not seem to be overfitting, because the accuracy in the training test is 0.79 and the accuracy on the test set is 0.78.
  • The test set is performing slightely worse on accuracy, but the performance on recall is the same (1.0) in both training and test sets.
  • 0% of the data is FN, so recall is 1 (148/148+0 = 1)
  • Since the model is not overfitting, we will interpret the tree.
In [162]:
# plots the tree
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_estimator_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [163]:
# print the tree interpretation
print(tree.export_text(best_estimator_model, feature_names=feature_names, show_weights=True))
|--- Income <= 93.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1345.33, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [62.41, 95.09] class: 1
|--- Income >  93.50
|   |--- Education <= 1.50
|   |   |--- weights: [268.40, 322.24] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [61.85, 1320.67] class: 1

In [164]:
# plot feature importance
importances = best_estimator_model.feature_importances_

indices = np.argsort(importances)
feature_importance = [indices, importances]
feature_importance = pd.DataFrame(feature_importance)
feature_importance = feature_importance.T
feature_importance.columns = ["index", "importance"]
feature_importance['feature'] = feature_names
feature_importance = feature_importance.sort_values(by="importance", ascending=False).head(5)
sns.barplot(x='feature', y='importance', data=feature_importance)
plt.show()
No description has been provided for this image

Decision tree with pre-pruning (interpretation):

  • Recall in the training set and in the test set is 1.0
  • The features considered in the tree are (in order of importance): income, education and average spending on credit card
  • The tree logic is:
  1. The first node is based on income (<= 93.5 thousand dollars)
  2. If the first statement is true, the second node evaluate the average credit card spending
  3. If the first statement if false, then the customers are evaluated on level of education
  • The feature importance makes sense when we compare it to the exploratory data analysis, in which we could see the relevance of these features to the personal loan feature: the higher the income, average credit card spending and level of education, the higher the tendency to purchase a personal loan.

Decision tree (Post)¶

In [165]:
# calculate clf, ccp_alphas and impurities
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [166]:
# plots the impurity vs effective alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image
In [167]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train)
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.2905331413097635
In [168]:
# plots the number of nodes vs alpha
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
No description has been provided for this image
In [169]:
# for each clf, calculate recall
recall_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = recall_score(y_train, pred_train)
    recall_train.append(values_train)
In [170]:
# for each clf, calculate recall
recall_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = recall_score(y_test, pred_test)
    recall_test.append(values_test)
In [171]:
# get train and test scores for each clf
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
In [172]:
# plots recall vs alpha
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
    ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [173]:
# gets the best post pruning model based on the tests
index_best_model = np.argmax(recall_test)
best_post_pruning_model = clfs[index_best_model]
print(best_post_pruning_model)
DecisionTreeClassifier(ccp_alpha=np.float64(0.00501714795532976),
                       class_weight='balanced', random_state=1)
In [174]:
# plots the confusion matrix on training set
confusion_matrix_sklearn(best_post_pruning_model, X_train, y_train)
No description has been provided for this image
In [175]:
# checks the performance on training set
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_post_pruning_model, X_train, y_train
)
decision_tree_post_perf_train
Out[175]:
Accuracy Recall Precision F1
0 0.934983 0.99696 0.593128 0.743764
In [176]:
# plots the confusion matrix on test set
confusion_matrix_sklearn(best_post_pruning_model, X_test, y_test)
No description has been provided for this image
In [177]:
# checks the performance on test set
decision_tree_post_test = model_performance_classification_sklearn(
    best_post_pruning_model, X_test, y_test
)
decision_tree_post_test
Out[177]:
Accuracy Recall Precision F1
0 0.940268 0.993243 0.625532 0.767624
  • In the post pruning model, the model is generalizing the unseen data well, because recall in test set is not improving.
  • The results in both training and test set are similar
  • The confusion matrix confirms the recall, because (147/147+1 ≃ 0.99)
  • So, the next step is to interpret the tree.
In [178]:
# plots the tree interpretation
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_post_pruning_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [179]:
# print the tree interpretation to compare to the plot
print(tree.export_text(best_post_pruning_model, feature_names=feature_names, show_weights=True))
|--- Income <= 93.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [1345.33, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- weights: [62.41, 95.09] class: 1
|--- Income >  93.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- Income <= 104.50
|   |   |   |   |--- weights: [23.75, 26.41] class: 1
|   |   |   |--- Income >  104.50
|   |   |   |   |--- weights: [233.61, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [11.05, 295.83] class: 1
|   |--- Education >  1.50
|   |   |--- Income <= 110.50
|   |   |   |--- CCAvg <= 2.45
|   |   |   |   |--- weights: [34.79, 5.28] class: 0
|   |   |   |--- CCAvg >  2.45
|   |   |   |   |--- weights: [8.84, 73.96] class: 1
|   |   |--- Income >  110.50
|   |   |   |--- weights: [18.22, 1241.43] class: 1

In [181]:
# plots feature importance
importances = best_post_pruning_model.feature_importances_

indices = np.argsort(importances)
feature_importance = [indices, importances]
feature_importance = pd.DataFrame(feature_importance)
feature_importance = feature_importance.T
feature_importance.columns = ["index", "importance"]
feature_importance['feature'] = feature_names
feature_importance = feature_importance.sort_values(by="importance", ascending=False).head(5)
sns.barplot(x='feature', y='importance', data=feature_importance)
plt.show()
No description has been provided for this image

Decision tree with post-pruning (interpretation):

  • Recall in the training set and in the test set is 0.99
  • The features considered in the tree are (in order of importance): income, family, education and average spending on credit card
  • The tree logic is:
  1. The first node is based on income (<= 93.5 thousand dollars)
  2. If the first statement is true, the second node evaluate the average credit card spending
  3. If the first statement if false, then the customers are filtered on level of education (education <= 1.5). If True, customers are evaluated on family and income. If false, then customers are evaluated on income and average credit card spending
  • The feature importance makes sense when we compare it to the exploratory data analysis, in which we could see the relevance of these features to the personal loan feature: the higher the income, average credit card spending and level of education, the higher the tendency to purchase a personal loan. In relation to family, we saw that 75% of customers with personal loans have at least two members in the family.
  • Comparing the pre-pruning tree and the post-pruning tree interpretations, we can conclude that the post-pruning tree is more complex on its decision making. However, both of the models selected similar features to predict the result (with exception of the family feature, added by the post-pruning tree).

Model Performance Comparison and Final Model Selection¶

In [182]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_default_perf_train.T,
        decision_tree_balanced_perf_train.T,
        decision_tree_pre_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[182]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 1.0 1.0 0.795455 0.934983
Recall 1.0 1.0 1.000000 0.996960
Precision 1.0 1.0 0.316346 0.593128
F1 1.0 1.0 0.480643 0.743764
In [183]:
# testing performance comparison

models_test_comp_df = pd.concat(
    [
        decision_tree_default_perf_test.T,
        decision_tree_balanced_perf_test.T,
        decision_tree_pre_perf_test.T,
        decision_tree_post_test.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree (sklearn default)",
    "Decision Tree with class_weight",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
Out[183]:
Decision Tree (sklearn default) Decision Tree with class_weight Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.978523 0.979866 0.787919 0.940268
Recall 0.871622 0.885135 1.000000 0.993243
Precision 0.908451 0.909722 0.318966 0.625532
F1 0.889655 0.897260 0.483660 0.767624

Conclusion:

  • As we have seen above, the first two models are overfitting
  • The best models are Decision Tree (Pre-Pruning) and Decision Tree (Post-Pruning)
  1. Both of them performed really well on recall (the selected measure to minimize false negatives). The pre-pruning model has a recall of 100% and the post-pruning model of 99% on both training and test sets.
  2. Since the level of recall is very similar on both models, with decimals of difference, we will evaluate other metrics to define the best model: precision and f1-score. Precision is important to minimize false positives and f1-score maintains a balance between false positives and false negatives, because it calculates the balance of precision and recall.
  3. The Decision Tree (Post-pruning) has a better performance of precision in both training (+28%) and test sets (31%) and almost mantains the same recall of the other model. So, the f1-score also performs better in this model (+28% on the test set).
  4. To conclude, the best model is Decision Tree (Post-Pruning) because it mantains the same level of recall (our target metric) and has much better precision. Hence, it performs better on f1-score as well.

Actionable Insights and Business Recommendations¶

Recommedations:¶

Potential customers for purchasing a personal loan:¶

  • Customers with higher income, more average spending on credit card and more advanced Education level tend to have more chance to purchase a personal loan. Number of members on the family is also important. Some important notes on that:
  1. 75% of customers with personal loans have annual income grater than 100 thousand dollars. On the other hand, 75% of customers without personal loans have income lower than 100 thousand dollars. However, there are high income outliers in customers who doesn't have personal loans. Thus, there might be a potential market for growing customers with high income and still no personal loans.
  2. Credit card is the primary product of the bank. 75% of customers with personal loan have an average spending greater than 2 thousand dollars a month. On the other hand, 75% of customers without personal loan have an average spending lower than 3 thousand dollars a month. Such as with income, there are outliers of customers with high average spending and no personal loans, which show a potential for expanding personal loans.
  3. 75% of customers with personal loans have at least two members in the family.

Prediction model:¶

  • The bank should deploy the decision tree model with post-pruning techniques.
  • This model has a great prediction recall metric (99.3% on test set), the most important metric to guarantee the bank won't leave any potential loaners outside of the campaign, maximing the sucess metric
  • Also, this model has a greater performance on precision (62% on the test set), so it guarantees that the bank will minimize the number of targetted customers who won't purchase the loan
  • This means the model also performed better in the f1-score (76% on the test set).
  • Thus, the sucess ratio of the campaign will be maximized
  • Check the last section to have more details on models comparison

Further steps:¶

  • The deployed model results should be monitored frequently to ensure better results
  • Further studies should be made to peronalize the marketing message for different clusters of customers

In [ ]: